170 research outputs found

    Sample-Efficient Model-Free Reinforcement Learning with Off-Policy Critics

    Full text link
    Value-based reinforcement-learning algorithms provide state-of-the-art results in model-free discrete-action settings, and tend to outperform actor-critic algorithms. We argue that actor-critic algorithms are limited by their need for an on-policy critic. We propose Bootstrapped Dual Policy Iteration (BDPI), a novel model-free reinforcement-learning algorithm for continuous states and discrete actions, with an actor and several off-policy critics. Off-policy critics are compatible with experience replay, ensuring high sample-efficiency, without the need for off-policy corrections. The actor, by slowly imitating the average greedy policy of the critics, leads to high-quality and state-specific exploration, which we compare to Thompson sampling. Because the actor and critics are fully decoupled, BDPI is remarkably stable, and unusually robust to its hyper-parameters. BDPI is significantly more sample-efficient than Bootstrapped DQN, PPO, and ACKTR, on discrete, continuous and pixel-based tasks. Source code: https://github.com/vub-ai-lab/bdpi.Comment: Accepted at the European Conference on Machine Learning 2019 (ECML

    A framework for reinforcement learning with autocorrelated actions

    Full text link
    The subject of this paper is reinforcement learning. Policies are considered here that produce actions based on states and random elements autocorrelated in subsequent time instants. Consequently, an agent learns from experiments that are distributed over time and potentially give better clues to policy improvement. Also, physical implementation of such policies, e.g. in robotics, is less problematic, as it avoids making robots shake. This is in opposition to most RL algorithms which add white noise to control causing unwanted shaking of the robots. An algorithm is introduced here that approximately optimizes the aforementioned policy. Its efficiency is verified for four simulated learning control problems (Ant, HalfCheetah, Hopper, and Walker2D) against three other methods (PPO, SAC, ACER). The algorithm outperforms others in three of these problems.Comment: The 27th International Conference on Neural Information Processing (ICONIP2020

    The Emergence of Norms via Contextual Agreements in Open Societies

    Full text link
    This paper explores the emergence of norms in agents' societies when agents play multiple -even incompatible- roles in their social contexts simultaneously, and have limited interaction ranges. Specifically, this article proposes two reinforcement learning methods for agents to compute agreements on strategies for using common resources to perform joint tasks. The computation of norms by considering agents' playing multiple roles in their social contexts has not been studied before. To make the problem even more realistic for open societies, we do not assume that agents share knowledge on their common resources. So, they have to compute semantic agreements towards performing their joint actions. %The paper reports on an empirical study of whether and how efficiently societies of agents converge to norms, exploring the proposed social learning processes w.r.t. different society sizes, and the ways agents are connected. The results reported are very encouraging, regarding the speed of the learning process as well as the convergence rate, even in quite complex settings

    Deep Reinforcement Learning: An Overview

    Full text link
    In recent years, a specific machine learning method called deep learning has gained huge attraction, as it has obtained astonishing results in broad applications such as pattern recognition, speech recognition, computer vision, and natural language processing. Recent research has also been shown that deep learning techniques can be combined with reinforcement learning methods to learn useful representations for the problems with high dimensional raw data input. This chapter reviews the recent advances in deep reinforcement learning with a focus on the most used deep architectures such as autoencoders, convolutional neural networks and recurrent neural networks which have successfully been come together with the reinforcement learning framework.Comment: Proceedings of SAI Intelligent Systems Conference (IntelliSys) 201

    Learning flexible sensori-motor mappings in a complex network

    Get PDF
    Given the complex structure of the brain, how can synaptic plasticity explain the learning and forgetting of associations when these are continuously changing? We address this question by studying different reinforcement learning rules in a multilayer network in order to reproduce monkey behavior in a visuomotor association task. Our model can only reproduce the learning performance of the monkey if the synaptic modifications depend on the pre- and postsynaptic activity, and if the intrinsic level of stochasticity is low. This favored learning rule is based on reward modulated Hebbian synaptic plasticity and shows the interesting feature that the learning performance does not substantially degrade when adding layers to the network, even for a complex problem

    Learning Shapes Spontaneous Activity Itinerating over Memorized States

    Get PDF
    Learning is a process that helps create neural dynamical systems so that an appropriate output pattern is generated for a given input. Often, such a memory is considered to be included in one of the attractors in neural dynamical systems, depending on the initial neural state specified by an input. Neither neural activities observed in the absence of inputs nor changes caused in the neural activity when an input is provided were studied extensively in the past. However, recent experimental studies have reported existence of structured spontaneous neural activity and its changes when an input is provided. With this background, we propose that memory recall occurs when the spontaneous neural activity changes to an appropriate output activity upon the application of an input, and this phenomenon is known as bifurcation in the dynamical systems theory. We introduce a reinforcement-learning-based layered neural network model with two synaptic time scales; in this network, I/O relations are successively memorized when the difference between the time scales is appropriate. After the learning process is complete, the neural dynamics are shaped so that it changes appropriately with each input. As the number of memorized patterns is increased, the generated spontaneous neural activity after learning shows itineration over the previously learned output patterns. This theoretical finding also shows remarkable agreement with recent experimental reports, where spontaneous neural activity in the visual cortex without stimuli itinerate over evoked patterns by previously applied signals. Our results suggest that itinerant spontaneous activity can be a natural outcome of successive learning of several patterns, and it facilitates bifurcation of the network when an input is provided

    Particle Swarm Optimization with Reinforcement Learning for the Prediction of CpG Islands in the Human Genome

    Get PDF
    BACKGROUND: Regions with abundant GC nucleotides, a high CpG number, and a length greater than 200 bp in a genome are often referred to as CpG islands. These islands are usually located in the 5' end of genes. Recently, several algorithms for the prediction of CpG islands have been proposed. METHODOLOGY/PRINCIPAL FINDINGS: We propose here a new method called CPSORL to predict CpG islands, which consists of a complement particle swarm optimization algorithm combined with reinforcement learning to predict CpG islands more reliably. Several CpG island prediction tools equipped with the sliding window technique have been developed previously. However, the quality of the results seems to rely too much on the choices that are made for the window sizes, and thus these methods leave room for improvement. CONCLUSIONS/SIGNIFICANCE: Experimental results indicate that CPSORL provides results of a higher sensitivity and a higher correlation coefficient in all selected experimental contigs than the other methods it was compared to (CpGIS, CpGcluster, CpGProd and CpGPlot). A higher number of CpG islands were identified in chromosomes 21 and 22 of the human genome than with the other methods from the literature. CPSORL also achieved the highest coverage rate (3.4%). CPSORL is an application for identifying promoter and TSS regions associated with CpG islands in entire human genomic. When compared to CpGcluster, the islands predicted by CPSORL covered a larger region in the TSS (12.2%) and promoter (26.1%) region. If Alu sequences are considered, the islands predicted by CPSORL (Alu) covered a larger TSS (40.5%) and promoter (67.8%) region than CpGIS. Furthermore, CPSORL was used to verify that the average methylation density was 5.33% for CpG islands in the entire human genome

    Human–agent collaboration for disaster response

    Get PDF
    In the aftermath of major disasters, first responders are typically overwhelmed with large numbers of, spatially distributed, search and rescue tasks, each with their own requirements. Moreover, responders have to operate in highly uncertain and dynamic environments where new tasks may appear and hazards may be spreading across the disaster space. Hence, rescue missions may need to be re-planned as new information comes in, tasks are completed, or new hazards are discovered. Finding an optimal allocation of resources to complete all the tasks is a major computational challenge. In this paper, we use decision theoretic techniques to solve the task allocation problem posed by emergency response planning and then deploy our solution as part of an agent-based planning tool in real-world field trials. By so doing, we are able to study the interactional issues that arise when humans are guided by an agent. Specifically, we develop an algorithm, based on a multi-agent Markov decision process representation of the task allocation problem and show that it outperforms standard baseline solutions. We then integrate the algorithm into a planning agent that responds to requests for tasks from participants in a mixed-reality location-based game, called AtomicOrchid, that simulates disaster response settings in the real-world. We then run a number of trials of our planning agent and compare it against a purely human driven system. Our analysis of these trials show that human commanders adapt to the planning agent by taking on a more supervisory role and that, by providing humans with the flexibility of requesting plans from the agent, allows them to perform more tasks more efficiently than using purely human interactions to allocate tasks. We also discuss how such flexibility could lead to poor performance if left unchecked
    corecore